feat(compute): add EC2 fleet compute strategy by MichaelWalker-git · Pull Request #31 · aws-samples/sample-autonomous-cloud-coding-agents

MichaelWalker-git · 2026-04-14T19:54:50Z

Summary

Add EC2 fleet compute strategy with SSM Run Command dispatch — a third compute backend alongside AgentCore (default) and ECS Fargate
New Ec2ComputeStrategy handler: finds idle instances via tags, uploads payload to S3, dispatches via SSM AWS-RunShellScript, polls GetCommandInvocation, cancels with CancelCommand
New Ec2AgentFleet CDK construct: Auto Scaling Group with launch template (AL2023 ARM64), security group (443 egress only), S3 payload bucket, IAM role with scoped permissions, Docker user data for pre-pulling images
Wire orchestrator polling, cancel-task SSM dispatch, and task-api SSM permissions for EC2
Stack wiring is commented-out (same pattern as ECS) — ready to enable per-repo via blueprint compute_type: 'ec2'
Add instance_type field to RepoConfig and BlueprintConfig for future GPU/custom instance type support

Test plan

mise //cdk:compile — no TypeScript errors
mise //cdk:test — 43 suites, 697 tests all passing (including new ec2-strategy and ec2-agent-fleet tests)
mise //cdk:synth — synthesizes without errors (EC2 block commented out)
mise //cdk:build — full build including lint passes
Deploy with EC2 block uncommented and run an end-to-end task with compute_type: 'ec2'

…AgentCore logic Introduce ComputeStrategy interface with SessionHandle/SessionStatus types and resolveComputeStrategy factory. Extract AgentCoreComputeStrategy from orchestrator.ts. Refactor orchestrate-task handler to use strategy pattern for session lifecycle (start/poll/stop). Pure refactor — no behavior change, identical CloudFormation output.

The mise install step downloads tools (trivy) from GitHub releases. Without GITHUB_TOKEN, unauthenticated requests hit the 60 req/hr rate limit, causing flaky CI failures.

Mise uses GITHUB_API_TOKEN (not GITHUB_TOKEN) for authenticated GitHub API requests when downloading aqua tools like trivy.

Trivy, grype, semgrep, osv-scanner, and gitleaks are only needed for security scanning tasks, not for the build/test/synth pipeline. Disable them via MISE_DISABLE_TOOLS to avoid GitHub API rate limits when mise tries to download them on every PR build.

- Keep gitleaks and osv-scanner enabled in CI build (only disable trivy/grype/semgrep which need GitHub API downloads) - Remove unused @aws-sdk/client-bedrock-agentcore mock from orchestrate-task.test.ts (SDK is no longer imported by orchestrator) - Update PR description to note additive strategy_type event field

1. Single source of truth for runtimeArn — removed constructor param, strategy now reads exclusively from blueprintConfig.runtime_arn 2. Lazy singleton for BedrockAgentCoreClient — module-level shared client avoids creating new TLS sessions per invocation 3. ComputeType union type ('agentcore' | 'ecs') with exhaustive switch and never-pattern in resolveComputeStrategy 4. Differentiated error handling in stopSession — ResourceNotFoundException (info), ThrottlingException/AccessDeniedException (error), others (warn) 5. Added logger.info('Session started') after full invoke+transition+event sequence in orchestrate-task.ts 6. Added start-session-composition.test.ts with integration tests for happy path, error path (failTask), and partial failure (transitionTask throws) 7. pollSession now throws NotImplementedError instead of returning stale 'running' status — clear signal for future developers

- Replace require() with ES import for BedrockAgentCoreClient mock - Fix import ordering in start-session-composition test

Wire ECS Fargate as a compute backend behind the existing ComputeStrategy interface, using the existing durable Lambda orchestrator. No separate stacks or Step Functions — ECS is a strategy option alongside AgentCore. Changes: - EcsComputeStrategy: startSession (RunTask), pollSession (DescribeTasks state mapping), stopSession (StopTask with graceful error handling) - EcsAgentCluster construct: ECS Cluster (container insights), Fargate task def (2 vCPU/4GB/ARM64), security group (TCP 443 egress only), CloudWatch log group, task role (DynamoDB, SecretsManager, Bedrock) - TaskOrchestrator: optional ECS props for env vars and IAM policies (ecs:RunTask/DescribeTasks/StopTask conditioned on cluster ARN, iam:PassRole conditioned on ecs-tasks.amazonaws.com) - Orchestrator polling: ECS compute-level crash detection alongside existing DDB polling (non-fatal, wrapped in try/catch) - AgentStack: conditional ECS infrastructure (ABCA_ENABLE_ECS env var) - Full test coverage: 15 ECS strategy tests, 9 construct tests, 5 orchestrator ECS tests. All 563 tests pass. Deployed and verified: stack deploys cleanly, CDK synth passes cdk-nag, agent task running on AgentCore path unaffected.

- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep) - Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType - Trim/filter ECS_SUBNETS to handle whitespace and trailing commas - Handle undefined exit code in ECS pollSession (container never started) - Scope iam:PassRole to specific ECS task/execution role ARNs - Validate all-or-nothing ECS props in TaskOrchestrator constructor - Remove dead hasEcsBlueprint detection; document env-flag driven approach - Add comment noting strategy_type as additive event field

The ECS container's default CMD starts uvicorn server:app which waits for HTTP POST to /invocations — but in standalone ECS nobody sends that request, leaving the agent idle. Override the container command to invoke entrypoint.run_task() directly with the full orchestrator payload via AGENT_PAYLOAD env var. Also add GITHUB_TOKEN_SECRET_ARN to the ECS task definition base environment.

Add a third compute backend (EC2 fleet with SSM Run Command) alongside the existing AgentCore and ECS strategies. This provides maximum flexibility with no image size limits, configurable instance types (including GPU), and full control over the compute environment. New files: - ec2-strategy.ts: ComputeStrategy implementation using EC2 tags for instance tracking and SSM RunShellScript for task dispatch - ec2-agent-fleet.ts: CDK construct with ASG, launch template, security group, S3 payload bucket, and IAM role - ec2-strategy.test.ts and ec2-agent-fleet.test.ts: full test coverage Wiring: - repo-config.ts: add 'ec2' to ComputeType, add instance_type field - compute-strategy.ts: add EC2 SessionHandle variant and resolver case - task-orchestrator.ts: add ec2Config prop with env vars and IAM grants - orchestrate-task.ts: enable compute polling for EC2 - cancel-task.ts: add SSM CancelCommand for EC2 tasks - task-api.ts: add ssm:CancelCommand permission for cancel Lambda - agent.ts: add commented-out EC2 fleet block (same pattern as ECS)

MichaelWalker-git · 2026-04-14T19:57:28Z

Recreating from a clean branch off main to avoid conflicts from prior commits

Copilot

Pull request overview

This PR adds a third compute backend (“EC2 fleet”) alongside the existing AgentCore and ECS options, wiring strategy selection into the orchestrator and extending CDK constructs/tests/docs to support the new backend.

Changes:

Introduces compute-strategy abstraction with implementations for AgentCore, ECS Fargate, and EC2 fleet (SSM Run Command + S3 payload).
Updates orchestrator start-session and polling to use the selected compute strategy and persist compute metadata for cancellation.
Adds CDK constructs/tests for ECS agent cluster and EC2 agent fleet, plus small docs/CI updates.

Reviewed changes

Copilot reviewed 31 out of 33 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
yarn.lock	Adds AWS SDK clients (EC2/ECS/S3/SSM) and transitive deps.
docs/src/content/docs/design/Architecture.md	Adds rationale section on separating orchestrator vs agent loops.
docs/design/ARCHITECTURE.md	Same rationale section mirrored into top-level design doc.
cdk/test/handlers/start-session-composition.test.ts	Integration-style orchestration step composition tests.
cdk/test/handlers/shared/strategies/agentcore-strategy.test.ts	Unit tests for AgentCore compute strategy.
cdk/test/handlers/shared/strategies/ecs-strategy.test.ts	Unit tests for ECS compute strategy.
cdk/test/handlers/shared/strategies/ec2-strategy.test.ts	Unit tests for EC2 compute strategy.
cdk/test/handlers/shared/preflight.test.ts	Normalizes compute_type casing in tests.
cdk/test/handlers/shared/compute-strategy.test.ts	Tests strategy resolution for agentcore/ecs/ec2.
cdk/test/handlers/orchestrate-task.test.ts	Removes older startSession tests now handled by strategies/composition tests.
cdk/test/handlers/cancel-task.test.ts	Adds ECS cancellation coverage and behavior tests.
cdk/test/constructs/task-orchestrator.test.ts	Adds ECS env var + IAM wiring tests.
cdk/test/constructs/task-api.test.ts	Adds cancel-task ECS env var + IAM wiring tests.
cdk/test/constructs/ecs-agent-cluster.test.ts	New tests for ECS cluster construct.
cdk/test/constructs/ec2-agent-fleet.test.ts	New tests for EC2 fleet construct.
cdk/src/stacks/agent.ts	Adds commented wiring blocks for ECS/EC2 backends.
cdk/src/handlers/shared/types.ts	Persists compute_type + compute_metadata on task records.
cdk/src/handlers/shared/strategies/agentcore-strategy.ts	New AgentCore compute strategy implementation.
cdk/src/handlers/shared/strategies/ecs-strategy.ts	New ECS compute strategy implementation.
cdk/src/handlers/shared/strategies/ec2-strategy.ts	New EC2 compute strategy implementation.
cdk/src/handlers/shared/repo-config.ts	Adds ComputeType union + instance_type config field.
cdk/src/handlers/shared/orchestrator.ts	Adds PollState fields + instance_type wiring; removes old startSession helper.
cdk/src/handlers/shared/compute-strategy.ts	New strategy interface + resolver.
cdk/src/handlers/orchestrate-task.ts	Uses compute strategies for start + compute-level polling.
cdk/src/handlers/cancel-task.ts	Adds ECS StopTask + EC2 SSM CancelCommand cancellation paths.
cdk/src/constructs/task-orchestrator.ts	Adds optional ECS/EC2 config env vars + IAM grants.
cdk/src/constructs/task-api.ts	Adds optional ECS/EC2 cancellation wiring + IAM grants.
cdk/src/constructs/ecs-agent-cluster.ts	New ECS cluster construct (Fargate task def + SG + IAM).
cdk/src/constructs/ec2-agent-fleet.ts	New EC2 ASG-based fleet construct (SSM-managed instances).
cdk/src/constructs/blueprint.ts	Extends blueprint compute type to include ec2.
cdk/package.json	Adds AWS SDK clients needed for new strategies.
.gitignore	Ignores local-docs directory.
.github/workflows/build.yml	Sets GitHub token env vars for CI tools; adjusts MISE_DISABLE_TOOLS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T20:01:14Z

+    // 3. Tag instance as busy
+    await getEc2Client().send(new CreateTagsCommand({
+      Resources: [instanceId],
+      Tags: [
+        { Key: 'bgagent:status', Value: 'busy' },
+        { Key: 'bgagent:task-id', Value: taskId },
+      ],
+    }));
+
+    // 4. Build the boot command (mirrors ECS strategy env vars and Python boot command)


The instance is tagged bgagent:status=busy before the SSM command is dispatched, but if SendCommand throws or returns no CommandId, the instance will remain stuck in busy (and with bgagent:task-id set). Wrap the dispatch in a try/catch/finally that reverts tags on failure, or tag busy only after a successful SendCommand response.

Suggested change

// 3. Tag instance as busy

await getEc2Client().send(new CreateTagsCommand({

Resources: [instanceId],

Tags: [

{ Key: 'bgagent:status', Value: 'busy' },

{ Key: 'bgagent:task-id', Value: taskId },

],

}));

// 4. Build the boot command (mirrors ECS strategy env vars and Python boot command)

// 3. Build the boot command (mirrors ECS strategy env vars and Python boot command)

Copilot · 2026-04-14T20:01:15Z

+    // The ECS container's default CMD starts the FastAPI server (uvicorn) which
+    // waits for HTTP POST to /invocations — but in standalone ECS nobody sends
+    // that request. We override the container command to invoke run_task()
+    // directly with the full orchestrator payload (including hydrated_context).
+    // This avoids the server entirely and runs the agent in batch mode.
+    const payloadJson = JSON.stringify(payload);
+
+    const containerEnv = [
+      { name: 'TASK_ID', value: taskId },
+      { name: 'REPO_URL', value: String(payload.repo_url ?? '') },
+      ...(payload.prompt ? [{ name: 'TASK_DESCRIPTION', value: String(payload.prompt) }] : []),
+      ...(payload.issue_number ? [{ name: 'ISSUE_NUMBER', value: String(payload.issue_number) }] : []),
+      { name: 'MAX_TURNS', value: String(payload.max_turns ?? 100) },
+      ...(payload.max_budget_usd !== undefined ? [{ name: 'MAX_BUDGET_USD', value: String(payload.max_budget_usd) }] : []),
+      ...(blueprintConfig.model_id ? [{ name: 'ANTHROPIC_MODEL', value: blueprintConfig.model_id }] : []),
+      ...(blueprintConfig.system_prompt_overrides ? [{ name: 'SYSTEM_PROMPT_OVERRIDES', value: blueprintConfig.system_prompt_overrides }] : []),
+      { name: 'CLAUDE_CODE_USE_BEDROCK', value: '1' },
+      // Full orchestrator payload as JSON — the Python wrapper reads this to
+      // call run_task() with all fields including hydrated_context.
+      { name: 'AGENT_PAYLOAD', value: payloadJson },
+      ...(payload.github_token_secret_arn


This strategy serializes the full orchestrator payload (including hydrated_context) into an ECS environment variable (AGENT_PAYLOAD). ECS task overrides have fairly small limits on environment size, so larger contexts can cause RunTask to fail at runtime. Consider switching to an S3-backed payload (pass an S3 URI/key in env) or another mechanism that doesn't depend on env-var size limits.

Copilot · 2026-04-14T20:01:15Z

+      } else if (computeType === 'ec2') {
+        // EC2-backed task — cancel the SSM command
+        const commandId = record.compute_metadata?.commandId;
+        const instanceId = record.compute_metadata?.instanceId;
+        if (commandId) {
+          try {
+            await ssmClient.send(new CancelCommandCommand({
+              CommandId: commandId,
+              ...(instanceId && { InstanceIds: [instanceId] }),
+            }));
+            logger.info('SSM CancelCommand invoked after cancel', { task_id: taskId, command_id: commandId, request_id: requestId });
+          } catch (stopErr) {
+            logger.warn('SSM CancelCommand failed after cancel (command may already be done)', {
+              task_id: taskId,
+              request_id: requestId,
+              error: stopErr instanceof Error ? stopErr.message : String(stopErr),
+            });
+          }
+        } else {
+          logger.warn('EC2 task cancel skipped: missing commandId in compute_metadata', {
+            task_id: taskId,
+            request_id: requestId,
+          });
+        }


For compute_type: 'ec2', cancel currently only calls SSM CancelCommand. If the command is cancelled mid-script, the cleanup/tag-reset section in the boot script may never run, leaving the instance stuck in bgagent:status=busy and effectively reducing fleet capacity. Consider also re-tagging the instance back to idle here (and deleting bgagent:task-id), or invoking the EC2 strategy’s stopSession logic from the cancel handler.

Copilot · 2026-04-14T20:01:15Z

+      let consecutiveEcsPollFailures = 0;
+      let consecutiveEcsCompletedPolls = 0;
+
+      // ECS compute-level crash detection: if DDB is not terminal, check ECS task status
+      if (
+        ddbState.lastStatus &&
+        !TERMINAL_STATUSES.includes(ddbState.lastStatus) &&
+        computeStrategy
+      ) {
+        try {
+          const ecsStatus = await computeStrategy.pollSession(sessionHandle);
+          if (ecsStatus.status === 'failed') {
+            const errorMsg = 'error' in ecsStatus ? ecsStatus.error : 'ECS task failed';
+            logger.warn('ECS task failed before DDB terminal write', {
+              task_id: taskId,
+              error: errorMsg,
+            });
+            await failTask(taskId, ddbState.lastStatus, `ECS container failed: ${errorMsg}`, task.user_id, true);
+            return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
+          }
+          if (ecsStatus.status === 'completed') {
+            consecutiveEcsCompletedPolls = (state.consecutiveEcsCompletedPolls ?? 0) + 1;
+            if (consecutiveEcsCompletedPolls >= MAX_CONSECUTIVE_ECS_COMPLETED_POLLS) {
+              // ECS task exited successfully but DDB never reached terminal — the agent
+              // likely crashed after container exit code 0 but before writing status.
+              logger.error('ECS task completed but DDB never caught up — failing task', {
+                task_id: taskId,
+                consecutive_completed_polls: consecutiveEcsCompletedPolls,
+              });
+              await failTask(taskId, ddbState.lastStatus, `ECS task exited successfully but agent never wrote terminal status after ${consecutiveEcsCompletedPolls} polls`, task.user_id, true);
+              return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
+            }
+            logger.warn('ECS task completed but DDB not terminal — waiting for DDB catchup', {
+              task_id: taskId,
+              consecutive_completed_polls: consecutiveEcsCompletedPolls,
+            });
+          }
+        } catch (err) {
+          consecutiveEcsPollFailures = (state.consecutiveEcsPollFailures ?? 0) + 1;
+          if (consecutiveEcsPollFailures >= MAX_CONSECUTIVE_ECS_POLL_FAILURES) {
+            logger.error('ECS pollSession failed repeatedly — failing task', {
+              task_id: taskId,
+              consecutive_failures: consecutiveEcsPollFailures,
+              error: err instanceof Error ? err.message : String(err),
+            });
+            await failTask(taskId, ddbState.lastStatus, `ECS poll failed ${consecutiveEcsPollFailures} consecutive times: ${err instanceof Error ? err.message : String(err)}`, task.user_id, true);
+            return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
+          }
+          logger.warn('ECS pollSession check failed (non-fatal)', {
+            task_id: taskId,
+            consecutive_failures: consecutiveEcsPollFailures,
+            error: err instanceof Error ? err.message : String(err),
+          });
+        }
+      }
+
+      return { ...ddbState, consecutiveEcsPollFailures, consecutiveEcsCompletedPolls };


The compute-level polling block is gated on blueprintConfig.compute_type === 'ecs' || 'ec2', but the variables/log messages/errors are all ECS-specific (e.g., consecutiveEcs*, "ECS container failed"). This will produce misleading failures for EC2 and makes the logic harder to extend. Consider renaming this to backend-neutral terminology and branching error messages based on sessionHandle.strategyType (or splitting ECS vs EC2 polling).

Suggested change

let consecutiveEcsPollFailures = 0;

let consecutiveEcsCompletedPolls = 0;

// ECS compute-level crash detection: if DDB is not terminal, check ECS task status

if (

ddbState.lastStatus &&

!TERMINAL_STATUSES.includes(ddbState.lastStatus) &&

computeStrategy

) {

try {

const ecsStatus = await computeStrategy.pollSession(sessionHandle);

if (ecsStatus.status === 'failed') {

const errorMsg = 'error' in ecsStatus ? ecsStatus.error : 'ECS task failed';

logger.warn('ECS task failed before DDB terminal write', {

task_id: taskId,

error: errorMsg,

});

await failTask(taskId, ddbState.lastStatus, `ECS container failed: ${errorMsg}`, task.user_id, true);

return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };

}

if (ecsStatus.status === 'completed') {

consecutiveEcsCompletedPolls = (state.consecutiveEcsCompletedPolls ?? 0) + 1;

if (consecutiveEcsCompletedPolls >= MAX_CONSECUTIVE_ECS_COMPLETED_POLLS) {

// ECS task exited successfully but DDB never reached terminal — the agent

// likely crashed after container exit code 0 but before writing status.

logger.error('ECS task completed but DDB never caught up — failing task', {

task_id: taskId,

consecutive_completed_polls: consecutiveEcsCompletedPolls,

});

await failTask(taskId, ddbState.lastStatus, `ECS task exited successfully but agent never wrote terminal status after ${consecutiveEcsCompletedPolls} polls`, task.user_id, true);

return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };

}

logger.warn('ECS task completed but DDB not terminal — waiting for DDB catchup', {

task_id: taskId,

consecutive_completed_polls: consecutiveEcsCompletedPolls,

});

}

} catch (err) {

consecutiveEcsPollFailures = (state.consecutiveEcsPollFailures ?? 0) + 1;

if (consecutiveEcsPollFailures >= MAX_CONSECUTIVE_ECS_POLL_FAILURES) {

logger.error('ECS pollSession failed repeatedly — failing task', {

task_id: taskId,

consecutive_failures: consecutiveEcsPollFailures,

error: err instanceof Error ? err.message : String(err),

});

await failTask(taskId, ddbState.lastStatus, `ECS poll failed ${consecutiveEcsPollFailures} consecutive times: ${err instanceof Error ? err.message : String(err)}`, task.user_id, true);

return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };

}

logger.warn('ECS pollSession check failed (non-fatal)', {

task_id: taskId,

consecutive_failures: consecutiveEcsPollFailures,

error: err instanceof Error ? err.message : String(err),

});

}

}

return { ...ddbState, consecutiveEcsPollFailures, consecutiveEcsCompletedPolls };

let consecutiveComputePollFailures = 0;

let consecutiveComputeCompletedPolls = 0;

const computeBackendLabel = sessionHandle.strategyType === 'ec2' ? 'EC2' : 'ECS';

// Compute-level crash detection: if DDB is not terminal, check compute session status.

if (

ddbState.lastStatus &&

!TERMINAL_STATUSES.includes(ddbState.lastStatus) &&

computeStrategy

) {

try {

const computeStatus = await computeStrategy.pollSession(sessionHandle);

if (computeStatus.status === 'failed') {

const errorMsg =

'error' in computeStatus ? computeStatus.error : `${computeBackendLabel} task failed`;

logger.warn(`${computeBackendLabel} task failed before DDB terminal write`, {

task_id: taskId,

error: errorMsg,

});

await failTask(

taskId,

ddbState.lastStatus,

`${computeBackendLabel} compute failed: ${errorMsg}`,

task.user_id,

true,

);

return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };

}

if (computeStatus.status === 'completed') {

consecutiveComputeCompletedPolls = (state.consecutiveEcsCompletedPolls ?? 0) + 1;

if (consecutiveComputeCompletedPolls >= MAX_CONSECUTIVE_ECS_COMPLETED_POLLS) {

// Compute session exited successfully but DDB never reached terminal —

// the agent likely crashed after compute completion but before writing status.

logger.error(`${computeBackendLabel} task completed but DDB never caught up — failing task`, {

task_id: taskId,

consecutive_completed_polls: consecutiveComputeCompletedPolls,

});

await failTask(

taskId,

ddbState.lastStatus,

`${computeBackendLabel} task exited successfully but agent never wrote terminal status after ${consecutiveComputeCompletedPolls} polls`,

task.user_id,

true,

);

return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };

}

logger.warn(`${computeBackendLabel} task completed but DDB not terminal — waiting for DDB catchup`, {

task_id: taskId,

consecutive_completed_polls: consecutiveComputeCompletedPolls,

});

}

} catch (err) {

consecutiveComputePollFailures = (state.consecutiveEcsPollFailures ?? 0) + 1;

if (consecutiveComputePollFailures >= MAX_CONSECUTIVE_ECS_POLL_FAILURES) {

logger.error(`${computeBackendLabel} pollSession failed repeatedly — failing task`, {

task_id: taskId,

consecutive_failures: consecutiveComputePollFailures,

error: err instanceof Error ? err.message : String(err),

});

await failTask(

taskId,

ddbState.lastStatus,

`${computeBackendLabel} poll failed ${consecutiveComputePollFailures} consecutive times: ${err instanceof Error ? err.message : String(err)}`,

task.user_id,

true,

);

return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };

}

logger.warn(`${computeBackendLabel} pollSession check failed (non-fatal)`, {

task_id: taskId,

consecutive_failures: consecutiveComputePollFailures,

error: err instanceof Error ? err.message : String(err),

});

}

}

return {

...ddbState,

consecutiveEcsPollFailures: consecutiveComputePollFailures,

consecutiveEcsCompletedPolls: consecutiveComputeCompletedPolls,

};

Copilot · 2026-04-14T20:01:16Z

+    // EC2 fleet compute strategy permissions (only when EC2 is configured)
+    if (props.ec2Config) {
+      this.fn.addToRolePolicy(new iam.PolicyStatement({
+        actions: [
+          'ec2:DescribeInstances',
+          'ec2:CreateTags',
+        ],
+        resources: ['*'],
+      }));
+
+      this.fn.addToRolePolicy(new iam.PolicyStatement({
+        actions: [
+          'ssm:SendCommand',
+          'ssm:GetCommandInvocation',
+          'ssm:CancelCommand',
+        ],
+        resources: ['*'],
+      }));
+
+      this.fn.addToRolePolicy(new iam.PolicyStatement({
+        actions: ['s3:PutObject'],
+        resources: [`arn:${Aws.PARTITION}:s3:::${props.ec2Config.payloadBucketName}/*`],
+      }));
+
+      this.fn.addToRolePolicy(new iam.PolicyStatement({
+        actions: ['iam:PassRole'],
+        resources: [props.ec2Config.instanceRoleArn],
+        conditions: {
+          StringEquals: {
+            'iam:PassedToService': 'ec2.amazonaws.com',
+          },
+        },
+      }));


The EC2 orchestrator IAM statements are very broad (ec2:CreateTags on *, ssm:* on *) and the iam:PassRole grant conditioned on ec2.amazonaws.com doesn’t appear to be used by the EC2 strategy (it doesn’t call APIs that pass a role). Consider removing the EC2 iam:PassRole statement, and scoping EC2 tagging permissions with conditions (e.g., ec2:ResourceTag/${fleetTagKey} == fleetTagValue and limiting aws:TagKeys) to reduce blast radius.

Copilot · 2026-04-14T20:01:16Z

+   */
+  readonly ec2FleetConfig?: {
+    readonly instanceRoleArn: string;
+  };


ec2FleetConfig includes instanceRoleArn, but that value isn’t used (the construct only conditionally grants ssm:CancelCommand on *). This makes the API misleading and suggests a scoping that doesn’t exist. Either remove instanceRoleArn from the props or use it to implement a more scoped permission model (e.g., document/service role, or additional permissions needed to reset instance tags on cancel).

Suggested change

*/

readonly ec2FleetConfig?: {

readonly instanceRoleArn: string;

};

* This object currently serves only as an enablement flag for EC2-backed cancellation.

*/

readonly ec2FleetConfig?: {};

Copilot · 2026-04-14T20:01:16Z

+
+    // Tag the ASG instances for fleet identification
+    // CDK auto-propagates tags from the ASG to instances
+    this.autoScalingGroup.node.defaultChild;


this.autoScalingGroup.node.defaultChild; is a no-op statement and can be removed. If the intent was to access/modify the underlying CFN resource, assign it to a variable and use it explicitly (otherwise this line adds noise without effect).

Suggested change

this.autoScalingGroup.node.defaultChild;

Copilot · 2026-04-14T20:01:17Z

+      '',
+      '# Set environment variables',
+      ...envExports,
+      '',


The EC2 boot script uses aws ecr get-login-password --region $AWS_REGION, but AWS_REGION is never set in the script. On stock AL2023 instances this is typically unset, which will make ECR login/pull fail. Consider deriving the region from instance metadata (like the user-data does) and exporting AWS_REGION/AWS_DEFAULT_REGION before the AWS CLI and docker run steps.

Suggested change

'',

'',

'# Resolve AWS region from instance metadata',

'IMDS_TOKEN=$(curl -sS -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")',

'AWS_REGION=$(curl -sS -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/dynamic/instance-identity/document | sed -n \'s/.*"region"[[:space:]]*:[[:space:]]*"\$[^"]*\$".*/\\1/p\')',

'export AWS_REGION',

'export AWS_DEFAULT_REGION="$AWS_REGION"',

'',

Copilot · 2026-04-14T20:01:17Z

+      '# Fetch payload from S3',
+      `aws s3 cp "s3://${EC2_PAYLOAD_BUCKET}/${payloadKey}" /tmp/payload.json`,
+      'export AGENT_PAYLOAD=$(cat /tmp/payload.json)',
+      '',


export AGENT_PAYLOAD=$(cat /tmp/payload.json) is unsafe: command substitution strips newlines and can mangle JSON, and large hydrated payloads may exceed shell/env-var limits. Prefer keeping the payload as a file (e.g., mount /tmp/payload.json into the container and have the Python entrypoint read it) or at least quote the assignment to preserve content.

MichaelWalker-git and others added 16 commits April 9, 2026 13:30

fix(ci): pass GITHUB_TOKEN to mise to avoid API rate limits

9af81ab

The mise install step downloads tools (trivy) from GitHub releases. Without GITHUB_TOKEN, unauthenticated requests hit the 60 req/hr rate limit, causing flaky CI failures.

fix(ci): set GITHUB_API_TOKEN for mise tool downloads

c446661

Mise uses GITHUB_API_TOKEN (not GITHUB_TOKEN) for authenticated GitHub API requests when downloading aqua tools like trivy.

fix: resolve ESLint errors in test files

83287f4

- Replace require() with ES import for BedrockAgentCoreClient mock - Fix import ordering in start-session-composition test

fix: remove duplicate ECS validation block after rebase

eb0bfa9

fix: remove local-docs from repo and add to .gitignore

5ec46fe

chore(project): merge main

c2b60b3

chore(pr): address review

379f895

chore(project): update docs

b461e27

MichaelWalker-git requested a review from a team as a code owner April 14, 2026 19:54

MichaelWalker-git requested a review from Copilot April 14, 2026 19:55

Copilot started reviewing on behalf of MichaelWalker-git April 14, 2026 19:55 View session

MichaelWalker-git closed this Apr 14, 2026

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compute): add EC2 fleet compute strategy#31

feat(compute): add EC2 fleet compute strategy#31
MichaelWalker-git wants to merge 16 commits intomainfrom
feat/compute-strategy

MichaelWalker-git commented Apr 14, 2026

Uh oh!

MichaelWalker-git commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-      '',
+      '',
+      '# Resolve AWS region from instance metadata',
+      'IMDS_TOKEN=$(curl -sS -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")',
+      'AWS_REGION=$(curl -sS -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/dynamic/instance-identity/document | sed -n \'s/.*"region"[[:space:]]*:[[:space:]]*"\\([^"]*\\)".*/\\1/p\')',
+      'export AWS_REGION',
+      'export AWS_DEFAULT_REGION="$AWS_REGION"',
+      '',

Conversation

MichaelWalker-git commented Apr 14, 2026

Summary

Test plan

Uh oh!

MichaelWalker-git commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants